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Abstract 

The  requirement  for  large  labelled  training 
corpora  is  widely  recognized  as  a  key  bot¬ 
tleneck  in  the  use  of  learning  algorithms  for 
information  extraction.  We  present  Tplex, 
a  semi-supervised  learning  algorithm  for  in¬ 
formation  extraction  that  can  acquire  extrac¬ 
tion  patterns  from  a  small  amount  of  labelled 
text  in  conjunction  with  a  large  amount  of  un¬ 
labelled  text.  Compared  to  previous  work, 
Tplex  has  two  novel  features.  First,  the  al¬ 
gorithm  does  not  require  redundancy  in  the 
fragments  to  be  extracted,  but  only  redundancy 
of  the  extraction  patterns  themselves.  Second, 
most  bootstrapping  methods  identify  the  high¬ 
est  quality  fragments  in  the  unlabelled  data  and 
then  assume  that  they  are  as  reliable  as  man¬ 
ually  labelled  data  in  subsequent  iterations. 

In  contrast,  Tplex  ’s  scoring  mechanism  pre¬ 
vents  errors  from  snowballing  by  recording 
the  reliability  of  fragments  extracted  from  un¬ 
labelled  data.  Our  experiments  with  several 
benchmarks  demonstrate  that  Tplex  is  usu¬ 
ally  competitive  with  various  fully-supervised 
algorithms  when  very  little  labelled  training 
data  is  available. 

1  Introduction 

Information  extraction  is  a  form  of  shallow  text  anal¬ 
ysis  that  involves  identifying  domain-specific  frag¬ 
ments  within  natural  language  text.  Most  recent  re¬ 
search  has  focused  on  learning  algorithms  that  auto¬ 
matically  acquire  extraction  patterns  from  manually 
labelled  training  data  (e.g.,  (Riloff,  1993;  Califf  and 
Mooney,  1999;  Soderland,  1999;  Freitag  and  Kushm¬ 
erick,  2000;  Ciravegna,  2001;  Finn  and  Kushmerick, 
2004)).  This  training  data  takes  the  form  of  the  original 
text,  annotated  with  the  fragments  to  be  extracted. 

Due  to  the  expense  and  tedious  nature  of  this  la¬ 
belling  process,  it  is  widely  recognized  that  a  key  bot¬ 
tleneck  in  deploying  such  algorithms  is  the  need  to  cre¬ 
ate  a  sufficiently  large  training  corpus  for  each  new  do¬ 
main.  In  response  to  this  challenge,  many  researchers 
have  investigated  semi-supervised  learning  algorithms 
that  learn  from  a  (relatively  small)  set  of  labelled  texts 


in  conjunction  with  a  (relatively  large)  set  of  unlabelled 
texts  (e.g.,  (Riloff,  1996;  Brin,  1998;  Yangarber  et  al., 
2002)). 

In  this  paper,  we  present  Tplex,  a  semi-supervised 
algorithm  for  learning  information  extraction  patterns. 
The  key  idea  is  to  exploit  the  following  recursive  defini¬ 
tions:  good  patterns  extract  good  fragments,  and  good 
fragments  are  extracted  by  good  patterns.  To  opera¬ 
tionalize  this  recursive  definition,  we  initialize  the  pat¬ 
tern  and  fragment  scores  with  labelled  data,  and  then 
iterate  until  the  scores  have  converged. 

Most  prior  semi-supervised  approaches  to  informa¬ 
tion  extraction  assume  that  fragments  are  essentially 
named  entities,  so  that  there  will  be  many  occurrences 
of  any  given  fragment.  For  example,  for  the  task  of 
discovering  diseases  (“influenza”,  “Ebola”,  etc),  prior 
algorithms  assume  that  each  disease  will  be  mentioned 
many  times,  and  that  every  occurrence  of  such  a  dis¬ 
ease  in  unlabelled  text  should  be  extracted.  However, 
it  may  not  be  the  case  that  fragments  to  be  extracted 
occur  more  than  once  in  the  corpus,  or  that  every  oc¬ 
currence  of  a  labelled  fragment  should  be  extracted. 
For  example,  in  the  well-known  CMU  Seminars  cor¬ 
pus,  any  given  person  usually  gives  just  one  seminar, 
and  a  fragment  such  as  “3pm”  sometimes  indicates  the 
start  time,  other  occurrences  indicate  the  end  time,  and 
some  occurrence  should  not  be  extracted  at  all.  Rather 
than  relying  on  redundancy  of  the  fragments,  Tplex 
exploits  redundancy  of  the  learned  extraction  patterns. 

Tplex  is  a  transductive  algorithm  (Vapnik,  1998),  in 
that  the  goal  is  to  perform  extraction  from  a  given  un¬ 
labelled  corpus,  given  a  labelled  corpus.  This  is  in  con¬ 
trast  to  the  typical  machine  learning  framework,  where 
the  goal  is  a  set  of  extraction  patterns  (which  can  of 
course  then  be  applied  to  new  unlabelled  text).  As  a 
side-effect,  Tplex  does  generate  a  set  of  extraction 
patterns  which  may  be  a  useful  in  their  own  right,  de¬ 
pending  on  the  application. 

We  have  compared  Tplex  with  various  competitors 
on  a  variety  of  real-world  extraction  tasks.  We  have  ob¬ 
served  that  Tplex’s  performance  matches  or  exceeds 
these  in  several  benchmark  tasks.  The  remainder  of  this 
paper  is  organized  as  follows.  After  describing  related 
work  in  more  detail  (Sec.  2),  we  describe  the  Tplex 
algorithm  (Sec.  3).  We  then  discuss  a  series  of  exper¬ 
iments  to  compare  Tplex  with  various  supervised  al- 
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gorithms  (Sec.  4).  We  conclude  with  a  summary  of  ob¬ 
servations  made  during  the  evaluation,  and  a  discussion 
of  future  work  (Sec.  5). 

2  Related  work 

A  review  of  machine  learning  for  information  extrac¬ 
tion  is  beyond  the  scope  of  this  paper;  see  e.g.  (Cardie, 
1997;  Kushmerick  and  Thomas,  2003). 

A  number  of  researchers  have  previously  developed 
bootstrapping  or  semi-supervised  approaches  to  infor¬ 
mation  extraction,  named  entity  recognition,  and  re¬ 
lated  tasks  (Riloff,  1996;  Brin,  1998;  Riloff  and  Jones, 
1999;  Agichtein  et  al.,  2001;  Yangarber  et  al.,  2002; 
Stevenson  and  Greenwood,  2005;  Etzioni  et  al.,  2005). 

Several  approaches  for  learning  from  both  labeled 
and  unlabeled  data  have  been  proposed  (Yarowsky, 
1995;  Blum  and  Mitchell,  1998;  Collins  and  Singer, 
1999)  where  the  unlabeled  data  is  utilised  to  boost  the 
performance  of  the  algorithm.  In  (Collins  and  Singer, 
1999)  Collins  and  Singer  show  that  unlabeled  data  can 
be  used  to  reduce  the  level  of  supervision  required  for 
named  entity  classification.  However,  their  approach 
is  reliant  on  the  presence  of  redundancy  in  the  named 
entities  to  be  identified. 

Tplex  is  most  closely  related  to  the  Nomen  algo¬ 
rithm  (Yangarber  et  al.,  2002).  Nomen  has  a  very  sim¬ 
ple  iterative  structure:  at  each  step,  a  very  small  num¬ 
ber  of  high-quality  new  fragments  are  extracted,  which 
are  treated  in  the  next  step  as  equivalent  to  seeds  from 
the  labeled  documents.  Nomen  has  a  number  of  pa¬ 
rameters  which  must  be  carefully  tuned  to  ensure  that 
it  does  not  over-generalise.  Erroneous  additions  to  the 
set  of  trusted  fragments  can  lead  to  a  snowballing  of 
errors. 

Also,  Nomen  uses  a  binary  scoring  mechanism, 
which  works  well  in  dense  corpora  with  substantial  re¬ 
dundancy.  However,  many  information  extraction  tasks 
feature  sparse  corpora  with  little  or  no  redundancy.  We 
have  extended  Nomen  by  allowing  it  to  make  finer- 
grained  (as  opposed  to  binary)  scoring  decisions  at  each 
iteration.  Instead  of  definitively  assigning  a  position  to 
a  given  field,  we  calculate  the  likelihood  that  it  belongs 
to  the  field  over  multiple  iterations. 

3  The  Tplex  algorithm 

The  goal  of  the  algorithm  is  to  identify  the  members 
of  the  target  fields  within  unlabelled  texts  by  generaliz¬ 
ing  from  seed  examples  in  labelled  training  texts.  We 
achieve  this  by  generalizing  boundary  detecting  pat¬ 
terns  and  scoring  them  with  a  recursive  scoring  func¬ 
tion. 

As  shown  in  Fig.  1,  Tplex  bootstraps  the  learning 
process  from  a  seed  set  of  labelled  examples.  The  ex¬ 
amples  are  used  to  populate  initial  pattern  sets  for  each 
target  field,  with  patterns  that  match  the  start  and  end 
positions  of  the  seed  fragments.  Each  pattern  is  then 
generalised  to  produce  more  patterns,  which  are  in  turn 
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Figure  1 :  An  overview  of  Tplex.  A  key  idea  of  the  al¬ 
gorithm  is  the  following  recursive  scoring  method:  pat¬ 
tern  scores  are  a  function  of  the  scores  of  the  positions 
they  extract,  and  position  scores  are  a  function  of  the 
scores  of  the  patterns  that  extract  them. 


applied  to  the  corpus  in  order  to  identify  more  base  pat¬ 
terns.  This  process  iterates  until  no  more  patterns  can 
be  learned. 

Tplex  employs  a  recursive  scoring  metric  in  which 
good  patterns  reinforce  good  positions,  and  good  posi¬ 
tions  reinforce  good  patterns.  Specifically,  we  calculate 
confidence  scores  for  positions  and  patterns.  Our  scor¬ 
ing  mechanism  calculates  the  score  of  a  pattern  as  a 
function  of  the  scores  of  the  positions  that  it  matches, 
and  the  score  of  a  position  as  a  function  of  the  scores 
of  the  patterns  that  extract  it. 

Tplex  is  a  multi-field  extraction  algorithm  in  that  it 
extracts  multiple  fields  simultaneously.  By  doing  this, 
information  learned  for  one  field  can  be  used  to  con¬ 
strain  patterns  learned  for  others.  Specifically,  our  scor¬ 
ing  mechanism  ensures  that  if  a  pattern  scores  highly 
for  one  field,  its  score  for  all  other  fields  is  reduced. 

In  the  remainder  of  this  section,  we  describe  the  al¬ 
gorithm  by  formalizing  the  space  of  learned  patterns, 
and  then  describing  Tplex’s  scoring  mechanism. 

3.1  Boundary  detection  patterns 

Tplex  extracts  fragments  of  text  by  identifying  prob¬ 
able  fragment  start  and  end  positions,  which  are  then 
assembled  into  complete  fragments.  Tplex’s  patterns 
are  therefore  boundary  detectors  which  identify  one 
end  of  a  fragment  or  the  other.  Tplex  learns  patterns 
to  identify  the  start  and  end  of  target  occurrences  inde¬ 
pendently  of  each  other.  This  strategy  has  previously 
been  employed  successfully  (Freitag  and  Kushmerick, 
2000;  Ciravegna,  2001;  Yangarber  et  al.,  2002;  Finn 
and  Kushmerick,  2004). 

Tplex’s  boundary  detectors  are  similar  to  those 
learned  by  BWI  (Freitag  and  Kushmerick,  2000).  A 
boundary  detector  has  two  parts,  a  left  pattern  and  a 
right  pattern.  Each  of  these  patterns  is  a  sequence  of 
tokens,  where  each  is  either  a  literal  or  a  generalized 
token.  For  example,  the  boundary  detector 
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[will  be  <punc>]  [<caps>  <fname>] 

would  correctly  find  the  start  of  a  name  in  an  utterance 
such  as  “will  be:  Dr  Robert  Boyle”  and  “will  be,  San¬ 
dra  Frederick”,  but  it  will  fail  to  identify  the  start  of  the 
name  in  “will  be  Dr.  Robert  Boyle”.  The  boundary  de¬ 
tectors  that  find  the  beginnings  of  fragments  are  called 
the  pre-patterns,  and  the  detectors  that  find  the  ends  of 
fragments  are  called  the  post-patterns. 

3.2  Pattern  generation 

As  input,  Tplex  requires  a  set  of  tagged  seed  docu¬ 
ments  for  training,  and  an  untagged  corpus  for  learn¬ 
ing.  The  seed  documents  are  used  to  initialize  the 
pre-pattern  and  post-pattern  sets  for  each  of  the  target 
fields.  Within  the  seed  documents  each  occurrence  of 
a  fragment  belonging  to  any  of  the  target  categories  is 
surrounded  by  a  set  of  special  tags  that  denote  the  field 
to  which  it  belongs. 

The  algorithm  parses  the  seed  documents  and  iden¬ 
tifies  the  tagged  fragments  in  each  document.  It  then 
generates  patterns  for  the  start  and  end  positions  of 
each  fragment  based  on  the  surrounding  tokens. 

Each  pattern  varies  in  length  from  2  to  n  tokens.  For 
a  given  pattern  length  £,  the  patterns  can  then  over¬ 
lap  the  position  by  zero  to  £  tokens.  For  example, 
a  pre -pattern  of  length  four  with  an  overlap  of  one 
will  match  the  three  tokens  immediately  preceeding  the 
start  position  of  a  fragment,  and  the  one  token  immedi¬ 
ately  following  that  position.  In  this  way,  we  generate 
^"=2(*  +  1)  patterns  from  each  seed  position.  In  our 
experiments,  we  set  the  maximum  pattern  length  to  be 
n  =  4. 

Tplex  then  grows  these  initial  sets  for  each  field  by 
generalizing  the  initial  patterns  generated  for  each  seed 
position.  We  employ  eight  different  generalization  to¬ 
kens  when  generalizing  the  literal  tokens  of  the  initial 
patterns.  The  wildcard  token  <*>  matches  every  lit¬ 
eral.  The  second  type  of  generalization  is  <punc>, 
which  matches  punctuation  such  as  commas  and  peri¬ 
ods.  Similarly,  the  token  <caps>  matches  literals  with 
an  initial  capital  letter,  <num>  matches  a  sequence  of 
digits,  <alpha_num>  matches  a  literal  consisting  of 
letters  followed  by  digits,  and  <num_alpha>  matches 
a  literal  consisting  of  digits  followed  by  letters.  The  fi¬ 
nal  two  generalisations  are  <fname>  and  <lname>, 
which  match  literals  that  appear  in  a  list  of  first  and  last 
names  (respectively)  taken  from  US  Census  data. 

All  patterns  are  then  applied  to  the  entire  corpus,  in¬ 
cluding  the  seed  documents.  When  a  pattern  matches  a 
new  position,  the  tokens  at  that  position  are  converted 
into  a  maximally-specialized  pattern,  which  is  added  to 
the  pattern  set.  Patterns  are  repeatedly  generalized  un¬ 
til  only  one  literal  token  remains.  This  whole  process 
iterates  until  no  new  patterns  are  discovered.  We  do 
not  generalize  the  new  maximally- specialized  patterns 
discovered  in  the  unlabelled  data.  This  ensures  that  all 
patterns  are  closely  related  to  the  seed  data.  (We  exper¬ 
imented  with  generalizing  patterns  from  the  unlabelled 


data,  but  this  rapidly  leads  to  overgeneralization.) 

The  locations  in  the  corpus  where  the  patterns  match 
are  regarded  as  potential  target  positions.  Pre-patterns 
indicate  potential  start  positions  for  target  fragments 
while  post-patterns  indicate  end  positions.  When  all  of 
the  patterns  have  been  matched  against  the  corpus,  each 
field  will  have  a  corresponding  set  of  potential  start  and 
end  positions. 

3.3  Notation  &  problem  statement 

Positions  are  denoted  by  r,  and  patterns  are  denoted 
by  p.  Formally,  a  pattern  is  equivalent  to  the  set  of 
positions  that  it  extracts. The  notation  p  — ►  r  indicates 
that  pattern  p  matches  position  r.  Fields  are  denoted  by 
/,  and  F  is  the  set  of  all  fields. 

The  labelled  training  data  consists  of  a  set  of  po¬ 
sitions  R  =  {...,r, ...},  and  a  labelling  function 
T  :  R  — >  F  U  {AT}  for  each  such  position.  T(r)  =  f 
indicates  that  position  r  is  labelled  with  field  /  in  the 
training  data.  T(r)  =  X  means  that  r  is  not  labelled 
in  the  training  data  (i.e.  r  is  a  negative  example  for  all 
fields). 

The  unlabelled  test  data  consists  of  an  additional  set 
of  positions  U . 

Given  this  notation,  the  learning  task  can  be  stated 
concisely  as  follows:  extend  the  domain  of  T  to  U , 
i.e.  generalize  from  T(r)  for  r  £  R,  to  T(r)  for  r  £  U. 

3.4  Pattern  and  position  scoring 

When  the  patterns  and  positions  for  the  fields  have  been 
identified  we  must  score  them.  Below  we  will  de¬ 
scribe  in  detail  the  recursive  manner  in  which  we  define 
score^(r)  in  terms  of  score^(p),  and  vice  versa.  Given 
that  definition,  we  want  to  find  fixed-point  values  for 
score j(p)  and  scor e^(r).  To  achieve  this,  we  initialize 
the  scores,  and  then  iterate  through  the  scoring  process 
(i.e.  calculate  scores  at  step  t  +  1  from  scores  at  step  t). 
This  process  repeats  until  convergence. 

Initialization.  As  the  scores  of  the  patterns  and  posi¬ 
tions  of  a  field  are  recursively  dependant,  we  must  as¬ 
sign  initial  scores  to  one  or  the  other.  Initially  the  only 
elements  that  we  can  classify  with  certainty  are  the  seed 
fragments.  We  initialise  the  scoring  function  by  assign¬ 
ing  scores  to  the  positions  for  each  of  the  fields.  In  this 
way  it  is  then  possible  to  score  the  patterns  based  on 
these  initial  scores. 

From  the  labelled  training  data,  we  derive  the  prior 
probability  n(f)  that  a  randomly  selected  position  be¬ 
longs  to  field  f  £  F: 

n(f)  =  \{r£R\T(r)  =  f}\/\R\. 

Note  that  1  —  7 r(/)  is  simply  the  prior  probabil¬ 

ity  that  a  randomly  selected  position  should  not  be  ex¬ 
tracted  at  all;  typically  this  value  is  close  to  1. 

Given  the  priors  n(f),  we  score  each  potential  posi- 
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tion  r  in  field  /: 

(  7 r(/)  if  r  e  U, 

score®  (r)  =  <  1  if  r  £  f?  A  T(r)  =  /,  and 

[  0  if  r  e  R  A  T(r)  ^  /. 

The  first  case  handles  positions  in  the  unlabelled  docu¬ 
ments;  at  this  point  we  don’t  know  anything  about  them 
and  so  fall  back  to  the  prior  probabilities.  The  second 
and  third  cases  handle  positions  in  the  seed  documents, 
for  which  we  have  complete  information. 

Iteration.  After  initializing  the  scores  of  the  posi¬ 
tions,  we  begin  the  iterative  process  of  scoring  the 
patterns  and  the  positions.  To  compute  the  score  of 
a  pattern  p  for  field  /  we  compute  a  positive  score, 
pos/(p);  a  negative  score,  neg/(p);  and  an  unknown 
score,  unk(p).  pos  f(p)  can  be  thought  of  as  a  measure 
of  the  benefit  of  p  to  /,  while  neg f(p)  measures  the 
harm  of  p  to  /,  and  unk(p)  measures  the  uncertainty 
about  the  field  with  which  p  is  associated. 

These  quantities  are  defined  as  follows:  pos/(p)  is 
the  average  score  for  field  f  of  positions  extracted  by 
p.  We  first  compute: 


This  definition  penalizes  patterns  that  are  either  in¬ 
accurate  or  have  low  coverage. 

Finally,  we  complete  the  iterative  step  by  calculating 
a  revised  score  for  each  position: 
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score 


EP 


■(r) 
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if  r  G  R 
if  r  €  U, 


where  min  =  min ftP-,r  score/  fa)  and  max  = 

max  f  p^r  E]p^r  score  ^  fa),  are  used  to  normalize  the 
scores  to  ensure  that  the  scores  of  unlabelled  positions 
never  exceed  the  scores  of  labelled  positions.  The  first 
case  in  the  function  for  score j+1(r)  handles  positive 
and  negative  seeds  (i.e.  positions  in  labelled  texts),  the 
second  case  is  for  unlabelled  positions. 

We  iterate  this  procedure  until  the  scores  of  the  pat¬ 
terns  and  positions  converge.  Specifically,  we  stop 
when 


Ef  Ep  |score^(p)-score‘  1  (i>) | 2  +  \ 
y  ^  score)  (r)  —  score)-  1  (r)  | 2  J 


In  our  experiments,  we  fixed  6=1. 


pos f(p)  =  -=-  22  score}  fa), 

p  p — >r 

where  Zp  =  E/ Ep^r  score/(r)  *s  a  normalizing 
constant  to  ensure  that  j  pos f{p)  =  1. 

For  each  field  f  and  pattern  p,  neg  /  fa)  is  the  extent 
to  which  p  extracts  positions  whose  field  is  not  /: 


neg/ fa)  =  1  -  pos/fa). 

Finally,  unkfa)  measures  the  degree  to  which  p  ex¬ 
tract  positions  whose  field  is  unknown: 


unkfa) 


1 

I -fa  r}\ 


22  unk(r), 


where  unk(r)  measures  the  degree  to  which  position  r 
is  unknown.  To  be  completely  ignorant  of  a  position’s 
field  is  to  fall  back  on  the  prior  field  probabilities  n(f). 
Therefore,  we  calculate  unk(r)  by  computing  the  sum 
of  squared  differences  between  score^(r)  and  7r (/): 

unk(r)  =  1  —  ^SSD(r), 

Z j 

SSD(r)  =  22  (scorej(r)  —  7r(/))2  , 

/ 

Z  =  maxSSD(r). 


The  normalization  constant  Z  ensures  that  unk(r)  =  0 
for  the  position  r  whose  scores  are  the  most  different 
from  the  priors — ie,  r  is  the  “least  unknown”  position. 

For  each  field  /  and  pattern  p,  score^  (p)  is  defined 
in  terms  of  pos/(p),  neg/(p)  and  unk(p)  as  follows: 

score,  p  =  - -posclp) 

1  pos/(p)+neg/(p)+unk(p)  1 

P°s  f(p)2 
1+unk  (p) 


3.5  Position  filtering  &  fragment  identification 

Due  to  the  nature  of  the  pattern  generation  strategy, 
many  more  candidate  positions  will  be  identified  than 
there  are  targets  in  the  corpus.  Before  we  can  proceed 
with  matching  start  and  end  positions  to  form  frag¬ 
ments,  we  filter  the  positions  to  remove  the  weaker  can¬ 
didates. 

We  rank  all  of  positions  for  each  field  according  to 
their  score.  We  then  select  positions  with  a  score  above 
a  threshold  6  as  potential  positions.  In  this  way  we 
reduce  the  number  of  candidate  positions  from  tens  of 
thousands  to  a  few  hundred. 

The  next  step  in  the  process  is  to  identify  complete 
fragments  within  the  corpus  by  matching  pre-positions 
with  post-positions.  To  do  this  we  compute  the  length 
probabilities  for  the  fragments  of  field  /  based  on  the 
lengths  of  the  seed  fragments  of  /.  Suppose  that  posi¬ 
tion  ri  has  been  identified  as  a  possible  start  for  field  /, 
and  position  r2  has  been  identified  as  a  possible  field  / 
end,  and  let  Pf(t)  be  the  fraction  of  field  f  seed  frag¬ 
ments  with  length  £.  Then  the  fragment  e  =  (ri,r2)  is 
assigned  a  score 

score/(e)  =  score/(ri)  •  score/(r2)  •  P/(r2  —  r\  +  1). 

Despite  these  measures,  overlapping  fragments  still 
occur.  Since  the  correct  fragments  can  not  overlap,  we 
know  that  if  two  extracted  fragments  overlap,  at  least 
one  must  be  wrong.  We  resolve  overlapping  fragments 
by  calculating  the  set  of  non-overlapping  fragments 
that  maximises  the  total  score  while  also  accounting 
for  the  expected  rate  of  occurrence  of  fragments  from 
each  field  in  a  given  document. 

In  more  detail,  let  E  be  the  set  of  all  fragments  ex¬ 
tracted  from  some  particular  document  D.  We  are  in¬ 
terested  in  the  score  of  some  subset  G  C  E  of  P’s 


28 


fragments.  Let  score(G)  be  the  chance  that  G  is  the 
correct  set  of  fragments  for  D.  Assuming  that  the  cor¬ 
rectness  of  the  fragments  can  be  determined  indepen¬ 
dently  given  that  the  correct  number  of  fragments  have 
been  identified  for  each  field,  then  score (G)  can  be  de¬ 
fined  zero  if  3  (n,  n),  (S2,  La)  G  G  such  that  (si,  n) 
overlaps  ( S2,f2 ),  and  score(G)  =  J~[  f  score(G/)  oth¬ 
erwise,  where  G/  C  G  is  the  fragments  in  G  for 
field  /.  The  score  of  G/  =  {e1,  e2, . . .}  is  defined  as 
score(G/)  =  Pr(|G/|)-J~[j  score(eJ),  where  Pr(|G/|) 
is  the  fraction  of  training  documents  that  have  |  G/ 1  in¬ 
stances  of  field  /. 

It  is  infeasible  to  enumerate  all  subsets  G  C  E,  so 
we  perform  a  heuristic  search.  The  states  in  the  search 
space  are  pairs  of  the  form  (G,P),  where  G  is  a  list 
of  good  fragments  (i.e.  fragments  that  have  been  ac¬ 
cepted),  and  P  is  a  list  of  pending  fragments  (i.e.  frag¬ 
ments  that  haven’t  yet  been  accepted  or  rejected). 

The  search  starts  in  the  state  ({},P),  and  states  of 
the  form  (G,  {})  are  terminal  states.  The  children  of 
state  (G,  P)  are  all  ways  to  move  a  single  fragment 
from  P  to  G.  When  forming  a  child’s  pending  set,  the 
moved  fragment  along  with  all  fragments  that  it  over¬ 
laps  are  removed  (meaning  that  the  moved  fragment  is 
selected  and  all  fragments  with  which  it  overlaps  are 
rejected).  More  precisely,  the  children  of  state  (G,  P) 
are: 

/  ^Ql  pi ^  A  G'=Gu{e}  A  1 

I  P‘ ~{e'  £P  |  e'  doesn’t  overlap  e}  j 

The  search  proceeds  as  follows.  We  maintain  a  set 
S  of  the  best  K  non-terminal  states  that  are  to  be  ex¬ 
panded,  and  the  best  terminal  state  B  encountered  so 
far.  Initially,  S  contains  just  the  initial  state.  Then,  the 
children  of  each  state  in  S  are  generated.  If  there  are 
no  such  children,  the  search  halts  and  B  is  returned  as 
the  best  set  of  fragments.  Otherwise,  B  is  updated  if 
appropriate,  and  S  is  set  to  the  best  K  of  the  new  chil¬ 
dren.  Note  that  K  =  00  corresponds  to  an  exhaustive 
search.  In  our  experiments,  we  used  K  =  5. 

4  Experiments 

To  evaluate  the  performance  of  our  algorithm  we  con¬ 
ducted  experiments  with  four  widely  used  corpora:  the 
CMU  seminar  set,  the  Austin  jobs  set,  the  Reuters 
acquisition  set  [www.isi.edu/info-agents/RISE],  and  the 
MUC-7  named  entity  corpus  [www.ldc.upenn.edu]. 

We  randomly  partitioned  each  of  the  datasets  into 
two  evenly  sized  subsets.  We  then  used  these  as  la¬ 
beled  and  unlabeled  sets.  In  each  experiment  the  al¬ 
gorithm  was  presented  with  a  document  set  comprised 
of  the  test  set  and  a  randomly  selected  percentage  of 
the  documents  from  the  training  set.  For  example,  if 
an  experiment  involved  providing  the  algorithm  with  a 
5%  seed  set,  then  5%  of  the  documents  in  the  training 
set  (2.5%  of  the  documents  in  the  entire  dataset)  would 
be  selected  at  random  and  used  in  conjunction  with  the 
documents  in  the  test  set. 


For  each  training  set  size,  we  ran  five  iterations  with 
a  randomly  selected  subset  of  the  documents  used  for 
training.  Since  we  are  mainly  motivated  by  scenarios 
with  very  little  training  data,  we  varied  the  size  of  the 
training  set  from  1-10%  (1-16%  forMUC-7NE)  of  the 
available  documents.  Precision,  recall  and  FI  were  cal¬ 
culated  using  the  B  WI  (Freitag  and  Kushmerick,  2000) 
scorer.  We  used  all  occurrences  mode,  which  records 
a  match  in  the  case  where  we  extract  all  of  the  valid 
fragments  in  a  given  document,  but  we  get  no  credit  for 
partially  correct  extractions. 

We  compared  Tplex  to  BWI  (Freitag  and 
Kushmerick,  2000),  LP2  (Ciravegna,  2001),  Elie 
(Finn  and  Kushmerick,  2004),  and  an  approach 
based  on  conditional  random  fields  (Lafferty  et  al., 

2001) .  The  data  for  BWI  was  obtained  using 
the  TIES  implementation  [tcc.itc.it/research/textec/tools- 
resources/ties.html].  The  data  for  the  LP2  learning  curve 
was  obtained  from  (Ciravegna,  2003).  The  results  for 
Elie  were  generated  by  the  current  implementation 
[http://smi.ucd.ie/aidan/Software.html].  For  the  CRF  re¬ 
sults,  we  used  MALLET’s  SimpleTagger  (McCallum, 

2002) ,  with  each  token  encoded  with  a  set  of  binary 
features  (one  for  each  observed  literal,  as  well  as  the 
eight  token  generalizations). 

Our  results  in  Fig.  2  indicate  that  in  Acquisitions 
dataset,  our  algorithm  substantially  outperforms  the 
competitors  at  all  points  on  the  learning  curve.  For  the 
other  datasets,  the  results  are  mixed.  For  SA  and  Jobs, 
Tplex  is  the  second  best  algorithm  at  the  low  end  of 
the  learning  curve,  and  steadily  loses  ground  as  more 
labelled  data  is  available.  Tplex  is  the  least  accurate 
algorithm  for  the  MUC  data.  In  Sec.  5,  we  discuss  a 
variety  of  modifications  to  the  Tplex  algorithm  that 
we  anticipate  may  improve  its  performance. 

Finally,  the  graph  in  Fig.  3  compares  Tplex  for  the 
SA  dataset,  in  two  configurations:  with  a  combination 
of  labelled  and  unlabelled  documents  as  usual,  and  with 
only  labelled  documents.  In  both  instances  the  algo¬ 
rithm  was  given  the  same  seed  and  testing  documents. 
In  the  first  case  the  algorithm  learned  patterns  using 
both  the  labeled  and  unlabeled  documents.  However, 
in  the  second  case,  only  the  labeled  documents  were 
used  to  generate  the  patterns.  These  data  confirm  that 
Tplex  is  indeed  able  to  improve  performance  from  un¬ 
labelled  data. 

5  Discussion 

We  have  described  Tplex,  a  semi-supervised  algo¬ 
rithm  for  learning  information  extraction  patterns.  The 
key  idea  is  to  exploit  the  following  recursive  definition: 
good  patterns  are  those  that  extract  good  fragments, 
and  good  fragments  are  those  that  are  extracted  by  good 
patterns.  This  definition  allows  Tplex  to  perform  well 
with  very  little  training  data  in  domains  where  other  ap¬ 
proaches  that  assume  fragment  redundancy  would  fail. 
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Percentage  of  data  used  for  training 


Figure  2:  FI  averaged  across  all  fields,  for  the  Seminar 
(top).  Acquisitions  (second).  Jobs  (third)  and  MUC-NE 
(bottom)  corpora. 

Conclusions.  From  our  experiments  we  have  ob¬ 
served  that  our  algorithm  is  particularly  competitive 
in  scenarios  where  very  little  labelled  training  data  is 
available.  We  contend  that  this  is  a  result  of  our  algo¬ 
rithm’s  ability  to  use  the  unlabelled  test  data  to  validate 
the  patterns  learned  from  the  training  data. 

We  have  also  observed  that  the  number  of  fields  that 
are  being  extracted  in  the  given  domain  affects  the  per¬ 
formance  of  our  algorithm.  Tplex  extracts  all  fields 
simultaneously  and  uses  the  scores  from  each  of  the 


Percentage  of  data  used  for  training 


Figure  3:  FI  averaged  across  all  fields,  for  the  Semi¬ 
nar  dataset  trained  on  only  labeled  data  and  trained  on 
labeled  and  unlabeled  data 


patterns  that  extract  a  given  position  to  determine  the 
most  likely  field  for  that  position.  With  more  fields 
in  the  problem  domain  there  is  potentially  more  infor¬ 
mation  on  each  of  the  candidate  positions  to  constrain 
these  decisions. 

Future  work.  We  are  currently  extending  Tplex  in 
several  directions.  First,  position  filtering  is  currently 
performed  as  a  distinct  post-processing  step.  It  would 
be  more  elegant  (and  perhaps  more  effective)  to  incor¬ 
porate  the  filtering  heuristics  directly  into  the  position 
scoring  mechanism.  Second,  so  far  we  have  focused 
on  a  BWI-like  pattern  language,  but  we  speculate  that 
richer  patterns  permitting  (for  example)  optional  or  re¬ 
ordered  tokens  may  well  deliver  substantial  increases 
in  accuracy. 

We  are  also  exploring  ideas  for  semi-supervised 
learning  from  the  machine  learning  community. 
Specifically,  probabilistic  finite-state  methods  such 
hidden  Markov  models  and  conditional  random  fields 
have  been  shown  to  be  competitive  with  more  tradi¬ 
tional  pattern-based  approaches  to  information  extrac¬ 
tion  (Fuchun  and  McCallum,  2004),  and  these  methods 
can  exploit  the  Expectation  Maximization  algorithm  to 
learn  from  a  mixture  of  labelled  and  unlabelled  data 
(Lafferty  et  al.,  2004).  It  remains  to  be  seen  whether 
this  approach  would  be  effective  for  information  ex¬ 
traction. 

Another  possibility  is  to  explore  semi-supervised  ex¬ 
tensions  to  boosting  (d’Alche  Buc  et  al.,  2002).  Boost¬ 
ing  is  a  highly  effective  ensemble  learning  technique, 
and  Bwi  uses  boosting  to  tune  the  weights  of  the 
learned  patterns,  so  if  we  generalize  boosting  to  han¬ 
dle  unlabelled  data,  then  the  learned  weights  may  well 
be  more  effective  than  those  calculated  by  Tplex. 
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